fix(graph): introduce Status.SKIPPED for cancel_node; graph continues downstream#2258
fix(graph): introduce Status.SKIPPED for cancel_node; graph continues downstream#2258Zelys-DFKH wants to merge 7 commits into
Conversation
|
Thanks for working on this @Zelys-DFKH — this addresses a real pain point. I filed the original issue (#2240) and have been thinking about the design here, especially considering future graph looping support (revisiting previous nodes via edge conditions). Proposal: Introduce two distinct statuses —
|
|
Agreed on SKIPPED vs COMPLETED. Updated; your analysis of the header-leak risk in
SKIPPED nodes live in
CANCELLED (abort-branch semantics) is a separate concern; I'll open an issue. |
|
Opened #2401 to track CANCELLED (abort-branch semantics). It also surfaces the |
14642dc to
593da1d
Compare
|
Do you mind rebasing? I am asking Strands team to review |
333dc08 to
504027b
Compare
| if before_event.cancel_node: | ||
| cancel_message = ( | ||
| before_event.cancel_node if isinstance(before_event.cancel_node, str) else "node cancelled by user" | ||
| skip_value = before_event.skip_node or before_event.cancel_node |
There was a problem hiding this comment.
Issue: The skip_node or cancel_node short-circuit means if both are set, only skip_node's value is used (since it's checked first with or). This interaction isn't documented and could surprise users who set both fields.
Suggestion: Add a brief note in the BeforeNodeCallEvent docstring about precedence, e.g., "If both skip_node and cancel_node are set, skip_node takes precedence." Alternatively, you could log a warning when both are truthy.
There was a problem hiding this comment.
Good catch. Added a note to the skip_node docstring: "Takes precedence over cancel_node when both are truthy."
|
|
||
| total_nodes: int = 0 | ||
| completed_nodes: int = 0 | ||
| skipped_nodes: int = 0 |
There was a problem hiding this comment.
Issue: The GraphResult.completed_nodes field previously counted all settled nodes but now excludes skipped nodes. This is a behavioral change to an existing public field that users may be relying on (e.g., if result.completed_nodes == result.total_nodes: ... to check "all done").
Suggestion: Document this breaking change prominently in the PR description. Users who relied on completed_nodes == total_nodes to mean "graph finished all work" would now need completed_nodes + skipped_nodes == total_nodes (excluding failures/interrupts). Consider whether this warrants a note in a changelog or migration guide.
There was a problem hiding this comment.
The direction is the opposite — skipped nodes are added to completed_nodes (the skip path calls self.state.completed_nodes.add(node)). Before this PR, cancel_node raised RuntimeError, so skip didn't exist as a concept; nothing assumed completed_nodes excluded skipped nodes. Callers that want to distinguish can check node.execution_status == Status.SKIPPED on individual results.
There was a problem hiding this comment.
Correction to my previous reply: GraphResult.completed_nodes (the public int in _build_result) only counts Status.COMPLETED nodes — skipped ones are filtered by the if n.execution_status == Status.COMPLETED guard. The bot's observation about completed_nodes == total_nodes is right for new callers using skip_node: they'd need completed_nodes + skipped_nodes == total_nodes to mean "all work settled." That's what skipped_nodes is for. The backward-compat point holds: cancel_node raised RuntimeError before this PR, so no existing code relied on skip behavior and there's nothing to migrate.
| ) | ||
| @pytest.mark.asyncio | ||
| async def test_graph_cancel_node(cancel_node, cancel_message): | ||
| async def test_graph_cancel_node_deprecated(cancel_node, cancel_message): |
There was a problem hiding this comment.
Issue: The test name says "deprecated" and the docstring says "emits DeprecationWarning", but since the deprecation warning was intentionally removed from the source code (per review feedback from yananym), this test name and docstring are now inaccurate.
Suggestion: If no deprecation warning will be emitted, rename to test_graph_cancel_node_legacy or test_graph_cancel_node_backward_compat and update the docstring to say "cancel_node still works as a backward-compatible alias for skip_node".
There was a problem hiding this comment.
The deprecation warning wasn't removed — BeforeNodeCallEvent.__setattr__ still emits it when cancel_node is assigned a truthy value. The test names are accurate.
|
Assessment: Request Changes The core fix is sound — replacing Review Categories
Good work on the |
06e6cf0 to
c162369
Compare
Replace Status.COMPLETED + RuntimeError with Status.SKIPPED + None for
nodes bypassed via cancel_node. A skipped node never ran and should not
be marked completed; downstream readiness checks treat SKIPPED the same
as COMPLETED by keeping skipped nodes in completed_nodes.
- Add Status.SKIPPED enum value
- NodeResult.result accepts None; to_dict/from_dict round-trip via
{"type": "skipped"}; get_agent_results() returns [] for None
- _build_node_input filters SKIPPED deps before building
dependency_results, so downstream nodes receive only the original
task with no orphaned "From node_x:" header
- _from_dict restores Status.SKIPPED on node.execution_status via
the persisted NodeResult.status rather than blindly setting COMPLETED
- GraphResult gains skipped_nodes counter; completed_nodes now counts
only nodes that actually ran
- reset_on_revisit correctly clears SKIPPED nodes for loop re-entry
(they are already in completed_nodes)
Introduce BeforeNodeCallEvent.skip_node as the preferred API for bypassing a node in graph/swarm hooks. BeforeNodeCallEvent.cancel_node continues to work but now emits a DeprecationWarning at the assignment site via __setattr__, so the warning points to user code rather than framework internals. Introduce MultiAgentNodeSkipEvent (type: multiagent_node_skip) for the skip signal. MultiAgentNodeCancelEvent is kept but documented as reserved for the future abort-branch CANCELLED semantic per strands-agents#2401. Behaviour: - Graph: skip-and-continue — node is settled as SKIPPED, downstream nodes proceed normally. - Swarm: skip-and-abort — existing abort behaviour is unchanged; swarm stops the current run. - Warning fires unconditionally when cancel_node is set truthy, even when skip_node is also set. - Falsy values (False, "") mean "do not skip" on both fields.
…ify semantics The variable name cancel_message was inherited from the old cancel_node API. With skip_node as the primary interface, rename to skip_message throughout. Also add docstrings to GraphResult.completed_nodes and skipped_nodes to document that completed_nodes counts only nodes that ran to completion (not skipped nodes), and add an inline comment explaining why a skipped node in a linear swarm sets Status.FAILED rather than Status.SKIPPED.
…ip semantics Shorten overlong inline comment in swarm.py from 224 to 119 chars (ruff E501). Replace the stale RST deprecated directive on MultiAgentNodeCancelEvent with a plain note that the class is not currently emitted (reserved for issue strands-agents#2401). Add cancel_node alias mention to MultiAgentNodeSkipEvent docstring. Update test_cancel.py: swap multiagent_node_cancel to multiagent_node_skip, remove the RuntimeError expectation from the graph test (graph continues after skip rather than raising), and expect Status.COMPLETED for the graph case.
1ab0a04 to
9198f8e
Compare
Description
When
cancel_nodeis set in aBeforeNodeCallEventhook,_execute_noderaisedRuntimeErrorafter emittingMultiAgentNodeCancelEvent, which propagated through_execute_nodes_paralleland killed the graph. Downstream nodes never ran, the overall status becameFAILED, and interrupt-and-resume workflows had no recovery path.The fix introduces
Status.SKIPPED. The bypassed node recordsresult=Noneandstatus=Status.SKIPPED, lands incompleted_nodesso downstream readiness checks are satisfied, and returns cleanly. UsingStatus.COMPLETEDhere would be wrong: a node that never ran is not completed, and conflating the two makes it impossible to distinguish "ran successfully" from "intentionally bypassed" in loop control and observability tools.Three fixes came alongside:
_build_node_inputfilters SKIPPED deps before buildingdependency_results, so when all upstream deps are skipped the downstream node receives the original task with no orphaned"From node_x:"header_from_dictreads the persistedNodeResult.statusto restoreStatus.SKIPPEDcorrectly on resume, rather than hardcodingStatus.COMPLETEDfor everything incompleted_nodesGraphResultgainsskipped_nodes: int;completed_nodesnow counts only nodes that actually ranAlso opened #2401 for the complementary CANCELLED semantic (abort-branch, downstream does not run). That issue raises the
cancel_node→skip_nodenaming question; if maintainers prefer Option A, the rename can happen in this PR before it merges.Related Issues
Resolves #2240
Documentation PR
N/A
Type of Change
Bug fix
Testing
Ran
hatch run prepare. Updatedtest_graph_cancel_node(parametrized ×2) to assertStatus.SKIPPEDon bothNodeResult.statusandnode.execution_status. Updatedtest_graph_cancel_node_downstream_executesto assert: downstream node executes (step_b.stream_async.assert_called_once()), step_b receives only the original task with no orphaned header,graph_result.skipped_nodes == 1, andgraph_result.completed_nodes == 1.hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.